In R, there are many functions/packages to read all kinds of datasets. Some examples are:
read.table, read.csv, for reading tabular datareadLines, for reading lines of a text fileThere are also many ways for wrinting data files
write.table, for writing tabular data to text files (i.e. CSV) or connectionswriteLines, for writing character data line-by-line to a file or connectionCSV stands for Comma Separated Values.
read.tableFor small to moderately sized datasets, you can usually call read.table without specifying any other arguments
Let’s use read.table() to read the text file Employee Sales
First, let’s locate our current working Directory
getwd()
Now, let’s display the list of files present in that directory
list.files()
## [1] "03_meeting_introduction_to_R_p2.html"
## [2] "03_meeting_introduction_to_R_p2.Rmd"
## [3] "Employees.csv"
## [4] "EmployeeSales.txt"
employee_sales <- read.table("EmployeeSales.txt", header=TRUE, sep=",")
is.data.frame(employee_sales)
## [1] TRUE
employee_sales
## Employee_ID FirstName LastName Education Occupation
## 1 1 John Yang Bachelors Professional
## 2 2 Rob Johnson Bachelors Management
## 3 3 Ruben Torres Partial College Skilled Manual
## 4 4 Christy Zhu Bachelors Professional
## 5 5 Rob Huang High School Skilled Manual
## 6 6 John Ruiz Bachelors Professional
## 7 7 John Miller Masters Degree Management
## 8 8 Christy Mehta Partial High School Clerical
## 9 9 Rob Verhoff Partial High School Clerical
## 10 10 Christy Carlson Graduate Degree Management
## 11 11 Gail Erickson Education Professional
## 12 12 Barry Johnson Education Management
## 13 13 Peter Krebs Graduate Degree Clerical
## 14 14 Greg Alderson Partial High School Clerical
## YearlyIncome Sales
## 1 90000 3578.2700
## 2 80000 3399.9900
## 3 50000 699.0982
## 4 80000 3078.2700
## 5 60000 2319.9900
## 6 70000 539.9900
## 7 80000 2320.4900
## 8 50000 24.9900
## 9 45000 24.9900
## 10 70000 2234.9900
## 11 90000 4319.9900
## 12 80000 4968.5900
## 13 50000 59.5300
## 14 45000 23.5000
ncol(employee_sales)
## [1] 7
nrow(employee_sales)
## [1] 14
dim(employee_sales)
## [1] 14 7
In this case, R will automatically
Telling R all these things directly makes R run faster and more efficiently. The read.csv() function is identical to read.table except that some of the defaults are set differently (like the sep argument).
Reading the Employees.csv file using read.csv().
employee <- read.csv("Employees.csv", TRUE, sep = ",")
employee
## FirstName LastName Education Occupation YearlyIncome
## 1 John Yang Bachelors Professional 90000
## 2 Rob Johnson Bachelors Management 80000
## 3 Ruben Torres Partial College Skilled Manual 50000
## 4 Christy Zhu Bachelors Professional 80000
## 5 Rob Huang High School Skilled Manual 60000
## 6 John Ruiz Bachelors Professional 70000
## 7 John Miller Masters Degree Management 80000
## 8 Christy Mehta Partial High School Clerical 50000
## 9 Rob Verhoff Partial High School Clerical 45000
## 10 Christy Carlson Graduate Degree Management 70000
## 11 Gail Erickson Education Professional 90000
## 12 Barry Johnson Education Management 80000
## 13 Peter Krebs Graduate Degree Clerical 50000
## 14 Greg Alderson Partial High School Clerical 45000
## HireDate
## 1 2018-01-06
## 2 2009-12-10
## 3 2009-12-11
## 4 2008-12-12
## 5 2012-09-08
## 6 2006-07-09
## 7 2012-08-09
## 8 2005-07-07
## 9 2015-09-13
## 10 2015-01-14
## 11 2002-10-06
## 12 2015-05-14
## 13 2014-01-13
## 14 2005-07-13
str(employee)
## 'data.frame': 14 obs. of 6 variables:
## $ FirstName : Factor w/ 8 levels "Barry","Christy",..: 5 7 8 2 7 5 5 2 7 2 ...
## $ LastName : Factor w/ 13 levels "Alderson","Carlson",..: 12 5 10 13 4 9 8 7 11 2 ...
## $ Education : Factor w/ 7 levels "Bachelors","Education",..: 1 1 6 1 4 1 5 7 7 3 ...
## $ Occupation : Factor w/ 4 levels "Clerical","Management",..: 3 2 4 3 4 3 2 1 1 2 ...
## $ YearlyIncome: int 90000 80000 50000 80000 60000 70000 80000 50000 45000 70000 ...
## $ HireDate : Factor w/ 14 levels "2002-10-06","2005-07-07",..: 14 6 7 5 9 4 8 2 13 11 ...
Now that we know how to read table and csv files from the package utils (one of the several built in packages that comes when you install R), we will learn how to read them using much faster functions that are able to deal with larger flat files and quickly.
We will need to load the package readr
library(readr)
## Warning: package 'readr' was built under R version 3.5.2
The package readr supports seven file formats with seven read_ functions:
read_csv(): comma separated (CSV) filesread_tsv(): tab separated filesread_delim(): general delimited filesread_fwf(): fixed width filesread_table(): tabular files where columns are separated by white-space.read_log(): web log filesemployee_sales2 <- read_delim("EmployeeSales.txt", delim = ",")
## Parsed with column specification:
## cols(
## Employee_ID = col_double(),
## FirstName = col_character(),
## LastName = col_character(),
## Education = col_character(),
## Occupation = col_character(),
## YearlyIncome = col_double(),
## Sales = col_double()
## )
employee_sales2
## # A tibble: 14 x 7
## Employee_ID FirstName LastName Education Occupation YearlyIncome Sales
## <dbl> <chr> <chr> <chr> <chr> <dbl> <dbl>
## 1 1 John Yang Bachelors Professio~ 90000 3578.
## 2 2 Rob Johnson Bachelors Management 80000 3400.
## 3 3 Ruben Torres Partial C~ Skilled M~ 50000 699.
## 4 4 Christy Zhu Bachelors Professio~ 80000 3078.
## 5 5 Rob Huang High Scho~ Skilled M~ 60000 2320.
## 6 6 John Ruiz Bachelors Professio~ 70000 540.
## 7 7 John Miller Masters D~ Management 80000 2320.
## 8 8 Christy Mehta Partial H~ Clerical 50000 25.0
## 9 9 Rob Verhoff Partial H~ Clerical 45000 25.0
## 10 10 Christy Carlson Graduate ~ Management 70000 2235.
## 11 11 Gail Erickson Education Professio~ 90000 4320.
## 12 12 Barry Johnson Education Management 80000 4969.
## 13 13 Peter Krebs Graduate ~ Clerical 50000 59.5
## 14 14 Greg Alderson Partial H~ Clerical 45000 23.5
You also can read a csv file using read_csv()
employee2 <- read_csv("Employees.csv", TRUE)
## Parsed with column specification:
## cols(
## FirstName = col_character(),
## LastName = col_character(),
## Education = col_character(),
## Occupation = col_character(),
## YearlyIncome = col_double(),
## HireDate = col_date(format = "")
## )
employee2
## # A tibble: 14 x 6
## FirstName LastName Education Occupation YearlyIncome HireDate
## <chr> <chr> <chr> <chr> <dbl> <date>
## 1 John Yang Bachelors Professional 90000 2018-01-06
## 2 Rob Johnson Bachelors Management 80000 2009-12-10
## 3 Ruben Torres Partial College Skilled Man~ 50000 2009-12-11
## 4 Christy Zhu Bachelors Professional 80000 2008-12-12
## 5 Rob Huang High School Skilled Man~ 60000 2012-09-08
## 6 John Ruiz Bachelors Professional 70000 2006-07-09
## 7 John Miller Masters Degree Management 80000 2012-08-09
## 8 Christy Mehta Partial High Sc~ Clerical 50000 2005-07-07
## 9 Rob Verhoff Partial High Sc~ Clerical 45000 2015-09-13
## 10 Christy Carlson Graduate Degree Management 70000 2015-01-14
## 11 Gail Erickson Education Professional 90000 2002-10-06
## 12 Barry Johnson Education Management 80000 2015-05-14
## 13 Peter Krebs Graduate Degree Clerical 50000 2014-01-13
## 14 Greg Alderson Partial High Sc~ Clerical 45000 2005-07-13
To get other types of data into R, we recommend starting with the tidyverse packages listed below. They’re certainly not perfect, but they are a good place to start. For rectangular data:
haven reads SPSS, Stata, and SAS files.
readxl reads excel files (both .xls and .xlsx).
DBI, along with a database specific backend (e.g. RMySQL, RSQLite, RPostgreSQL etc) allows you to run SQL queries against a database and return a data frame.
To Subset vectors we use the [ operator.
x <- c("a", "b", "c", "c", "d", "a")
x[1] ## Extract the first element
## [1] "a"
x[2] ## Extract the second element
## [1] "b"
The [ operator can be used to extract multiple elements of a vector by passing the operator an integer sequence.
x[1:4]
## [1] "a" "b" "c" "c"
The sequence does not have to be in order; you can specify any arbitrary integer vector.
x[c(1, 3, 4)]
## [1] "a" "c" "c"
We can also pass a logical sequence to the [ operator to extract elements of a vector that satisfy a given condition.
x[x > "a"]
## [1] "b" "c" "c" "d"
x <- matrix(1:6, 2, 3)
x
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
x[1, 2]
## [1] 3
x[1, ] # extract row 1
## [1] 1 3 5
x <- list(foo = 1:4, bar = 0.6, baz = "hello")
x
## $foo
## [1] 1 2 3 4
##
## $bar
## [1] 0.6
##
## $baz
## [1] "hello"
The [[ operator can be used to extract single elements from a list. Here we extract the first element of the list.
x[[1]]
## [1] 1 2 3 4
x[["bar"]] # extract element by named indices
## [1] 0.6
x$bar # extract elements by name
## [1] 0.6
x[c(1, 3)] # extract the first and third elements of a list
## $foo
## [1] 1 2 3 4
##
## $baz
## [1] "hello"
employee[,c(1,2,4)] # extract column 1, 2, and 4
## FirstName LastName Occupation
## 1 John Yang Professional
## 2 Rob Johnson Management
## 3 Ruben Torres Skilled Manual
## 4 Christy Zhu Professional
## 5 Rob Huang Skilled Manual
## 6 John Ruiz Professional
## 7 John Miller Management
## 8 Christy Mehta Clerical
## 9 Rob Verhoff Clerical
## 10 Christy Carlson Management
## 11 Gail Erickson Professional
## 12 Barry Johnson Management
## 13 Peter Krebs Clerical
## 14 Greg Alderson Clerical
employee[employee$Education == "Bachelors",] # subset only the professionals with Bachelor degree
## FirstName LastName Education Occupation YearlyIncome HireDate
## 1 John Yang Bachelors Professional 90000 2018-01-06
## 2 Rob Johnson Bachelors Management 80000 2009-12-10
## 4 Christy Zhu Bachelors Professional 80000 2008-12-12
## 6 John Ruiz Bachelors Professional 70000 2006-07-09
x <- c(1, 2, NA, 4, NA, 5)
x
## [1] 1 2 NA 4 NA 5
bad <- is.na(x)
bad
## [1] FALSE FALSE TRUE FALSE TRUE FALSE
x[!bad]
## [1] 1 2 4 5
The dataset airquality is available in R
?airquality
head(airquality,15)
## Ozone Solar.R Wind Temp Month Day
## 1 41 190 7.4 67 5 1
## 2 36 118 8.0 72 5 2
## 3 12 149 12.6 74 5 3
## 4 18 313 11.5 62 5 4
## 5 NA NA 14.3 56 5 5
## 6 28 NA 14.9 66 5 6
## 7 23 299 8.6 65 5 7
## 8 19 99 13.8 59 5 8
## 9 8 19 20.1 61 5 9
## 10 NA 194 8.6 69 5 10
## 11 7 NA 6.9 74 5 11
## 12 16 256 9.7 69 5 12
## 13 11 290 9.2 66 5 13
## 14 14 274 10.9 68 5 14
## 15 18 65 13.2 58 5 15
good <- complete.cases(airquality)
head(airquality[good, ])
## Ozone Solar.R Wind Temp Month Day
## 1 41 190 7.4 67 5 1
## 2 36 118 8.0 72 5 2
## 3 12 149 12.6 74 5 3
## 4 18 313 11.5 62 5 4
## 7 23 299 8.6 65 5 7
## 8 19 99 13.8 59 5 8
Exercise 1: Extract from employee dataset the employees with YearlyIncome > 70000
Times are represented by the POSIXct or the POSIXlt class
date_ex <- c("1990-02-05")
class(date_ex)
## [1] "character"
date_ex <- as.Date(date_ex)
class(date_ex)
## [1] "Date"
You can extract pieces of dates and/or times.
weekdays: give the day of the weekmonths: give the month namequarters: give the quarter number (“Q1”, “Q2”, “Q3”, or “Q4”)weekdays(date_ex)
## [1] "Monday"
months(date_ex)
## [1] "February"
quarters(date_ex)
## [1] "Q1"
You can get Current Date and Time using the following functions.
Sys.time()
Sys.Date()
Exercise 2: Extract weekday, month, quarter, and day of the week from today’s date.
We can use mathematical operations on dates and times. We can do + and - and comparisons too (i.e. ==, <=)
x <- as.Date("2018-03-04")
y <- as.Date("2018-04-05")
x+7 # Add 7 days to the date
## [1] "2018-03-11"
x-y
## Time difference of -32 days
x > y # logical statement
## [1] FALSE
Exercise 3: Extract the weekdays and months of the HireDate in the employee dataset.